-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
near search: add min_interval
#1520
Conversation
Currently, some tests are failing. groonga/lib/mrb/scripts/scan_info.rb Line 10 in 870f426
Line 1236 in 870f426
si->max_interval is always 0 ...
|
This behaviour is caused by my changes, but I can't understand what of my changes does actually affect as I didn't touch |
Hmm... I have idiotically misspelled the function name...
|
That case means these are a same word at that context.
skipped when GRN_OP_ORDERED_NEAR_PHRASE
94a3b0b
to
57fccbe
Compare
Would you please review this? |
min_interval
{"content": "abc123456789def"} | ||
] | ||
[[0,0.0,0.0],1] | ||
select Entries --filter 'content *NP-1,0,12,11"abc def"' --output_columns '_score, content' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, when max_elements_intervals
specified, the interval between abc
and def
for abc123456789def
is regarded as:
11
when checking min_interval
.
12
when checking max interval specified by max_element_intervals
.
This is because when checking max_interval
and min_interval
, the interval is calculated as 11
(
Line 12031 in b50a988
interval -= (data->token_info->n_tokens_in_phrase - 1); |
And when checking
max_element_intervals
and min_interval
, the interval is calculated as 12
. (Line 11766 in b50a988
int32_t interval = pos - previous_pos; |
The diff of these intervals come from we don't subtract tokens in a phrase when calculating interval for checking
max_element_intervals
and min_interval
.
I think this is a bug and I plan to work to fix this #1519.
lib/grn_ecmascript.lemon
Outdated
int min_interval; | ||
GRN_INT32_POP(&efsi->min_interval_stack, min_interval); | ||
grn_expr_append_const_int(efsi->ctx, efsi->e, min_interval, | ||
GRN_OP_PUSH, 1); | ||
n_args++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct?
Does this mean if and only if max_element_intervals
is specified, min_interval
must be specified too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention is:
- Users must also specify
max_element_intervals
in order to specifymin_interval
.- They can omit the actual value of
max_element_intervals
like*NP-1,0,,10 "hoge hoge"
- They can omit the actual value of
- Users can omit
min_interval
. - if this is out of
if (max_element_intervals)
,min_interval
could be the 3rd argument, but I couldn't get any good idea to decide that the 3rd argument ismin_interval
ormax_element_intervals
.- e.g.
*NP-1,0,3 "hoge hoge"
, we can't decide that users intend3
ismin_interval
ormax_element_intervals
.
- e.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_interval
must be specified too?
As the comment above, we can omit min_interval
in the current implementation.
For example, in the current implementation, if we specify the filter or query as below, min_interval
is the default value INT32_MIN
.
*NP-1,0,2|1 "a 2 c"
(I confirmed that with a debugger.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will modify this as term of readability.
Please wait a while...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always a value to efsi->min_interval_stack
but this pop conditionally.
Does this work with --filter 'X *N "A B" && X *N,1,2,3 "A B"
? Does the second *N
use DEFAULT_MIN_INTERVAL
not 3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I confirmed, it works fine. The second *N uses 3.
I have added tests for those cases.
test/command/suite/select/filter/near/min_interval/with_max/match.expected
Outdated
Show resolved
Hide resolved
to GRN_II_DEFAULT_NEAR_MIN_INTERVAL
Would you re-review this? |
Hmm, I noticed that we have no test case for using this with |
Added. |
私も
diff --git a/lib/grn_ecmascript.lemon b/lib/grn_ecmascript.lemon
index 33f2feccd..e8baadacc 100644
--- a/lib/grn_ecmascript.lemon
+++ b/lib/grn_ecmascript.lemon
@@ -399,6 +399,8 @@ relational_expression ::= relational_expression NEAR shift_expression. {
grn_obj *max_element_intervals;
GRN_PTR_POP(&efsi->max_element_intervals_stack,
max_element_intervals);
+ int min_interval;
+ GRN_INT32_POP(&efsi->min_interval_stack, min_interval);
if (!max_element_intervals) {
break;
}
@@ -406,8 +408,6 @@ relational_expression ::= relational_expression NEAR shift_expression. {
GRN_OP_PUSH, 1);
grn_expr_take_obj(efsi->ctx, efsi->e, max_element_intervals);
n_args++;
- int min_interval;
- GRN_INT32_POP(&efsi->min_interval_stack, min_interval);
grn_expr_append_const_int(efsi->ctx, efsi->e, min_interval,
GRN_OP_PUSH, 1);
n_args++; |
あと、この変更の問題ではないんですが、 |
Always pop `min_interval`
なるほど、確認不足でした。。。 |
こちらについては、issue #1528 に起票しましたので、別のタスクとして実施させてください。 |
lib/grn_ecmascript.lemon
Outdated
grn_expr_append_const_int(efsi->ctx, efsi->e, min_interval, | ||
GRN_OP_PUSH, 1); | ||
n_args++; | ||
} while(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} while(0); | |
} while (false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修正しました。
lib/grn_ii.h
Outdated
@@ -200,6 +202,7 @@ struct _grn_select_optarg { | |||
int additional_last_interval; | |||
grn_obj *query_options; | |||
grn_obj *max_element_intervals; | |||
int *min_interval; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grn_select_optarg
は公開インターフェイスじゃないのでint
でいい気がします。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intに直しました。
lib/ii.c
Outdated
return (interval <= | ||
(data->max_interval + data->additional_last_interval) && | ||
interval_without_last_token <= data->max_interval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
私はカッコが残っていたほうが好みですが、カッコを取りたいなら中途半端に取らないでもっと取りませんか?
return (interval <= | |
(data->max_interval + data->additional_last_interval) && | |
interval_without_last_token <= data->max_interval); | |
return interval <= (data->max_interval + data->additional_last_interval) && | |
interval_without_last_token <= data->max_interval; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ここにmin_interval
を入れていたときに追加した変更で、現在は不要な変更なので、ここの変更自体をリバーとしました。
lib/proc.c
Outdated
@@ -4350,6 +4350,7 @@ selector_in_values(grn_ctx *ctx, grn_obj *table, grn_obj *index, | |||
search_options.similarity_threshold = 0; | |||
search_options.max_interval = 0; | |||
search_options.additional_last_interval = 0; | |||
search_options.min_interval = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grn_search_optarg
でのメンバーの定義順と同じ順番にしませんか?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修正しました。
{"content": "abc123456789def"} | ||
] | ||
[[0,0.0,0.0],1] | ||
select Entries --filter 'content *NP12,1,,10"abc def$"' --output_columns '_score, content' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10じゃなくて11がボーダーじゃないですか?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
おっしゃるとおりですので、11に直しました。
(色々試しているときに10を入れてそのままになっていました)
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
度々のご確認ありがとうございます。ご指摘の事項に対応しました。 |
Add an argument to specify
min_interval
for the near search family.The intervals of phrases(words) should be greater or equals to
min_interval
.The default value of
min_interval
is INT32_MIN(-2147483648
).New syntax:
We should also specify
${MAX_INTERVAL}
,${ADDITIONAL_LAST_INTERVAL}
,${MAX_PHRASE_INTERVAL_1}
( exclude*N
) in order to specify${MIN_INTERVAL}
because it is added as the 4th argument.Example:
In the case above, the interval between
abc
anddef
ofabc123456789def
is11
.So
*NP-1,0,,11
matches and*NP-1,0,,12
doesn't match.